An Examination of Variation in Rater Severity Over Time: A Study in Rater Drift

نویسندگان

  • Mark Wilson
  • Sue Bennett
  • Gerry Shelton
چکیده

Ratings on performance tasks from two scoring sessions of an eighth grade mathematics examination, developed by the California State Department of Education, were used (a) to study the feasibility of estimating IRT rater severity information within a scoring session, (b) to investigate the variation in rater severity within rating sessions (which we called rater drift), and (c) to examine the relationship (or lack thereof) of the IRT rater severity information to traditional information based on expert re-ratings. We found that it was indeed feasible to give feedback on half-day intervals, that that information was interpretable by the raters, that there was considerable variation in rater severity within the scoring session, and that the IRT information was not redundant with the traditional information. We also investigated the impact of the estimated rater severities using an IRT approach, and showed that they accounted for about half of the residual misfit. In addition, we found that table leaders (expert raters assigned to lead small groups of raters) exhibited more variation in rater severity than the regular raters. This observation may have as much to do with the somewhat odd sample of student work that they examine as it does to any systematic pattern of ratings on their part. We investigated the effect of table leader’s severity on the raters they were leading, and found that there was no systematic

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Test-Retest and Inter-Rater Reliability Study of the Schedule for Oral-Motor Assessment in Persian Children

Objectives: Reliable and valid clinical tools to screen, diagnose, and describe eating functions and dysphagia in children are highly warranted. Today most specialists are aware of the role of assessment scales in the treatment of affected individuals. However, the problem is that the clinical tools used might be nonstandard, and worldwide, there is no integrated assessment performed to assess ...

متن کامل

A Study of Raters’ Behavior in Scoring L2 Speaking Performance: Using Rater Discussion as a Training Tool

The studies conducted so far on the effectiveness of resolution methods including the discussion method in resolving discrepancies in rating have yielded mixed results. What is left unnoticed in the literature is the potential of discussion to be used as a training tool rather than a resolution method. The present study addresses this research gap by exploring the data coming from rating behavi...

متن کامل

Evaluation of Spasticity Using the Ashworth Scale with Intermediate Scores (ASIS)

Objectives: The main purpose of this research was to study and contribute to an accurate test of spastic limb. The intra, inter rater reliability of the test was examined. Methods: The present study was carried out in two parts In the first part of the study, the modified Ashworth Scale with Intermediate Scores (ASIS) was studied. During the second part of the study the intra, inter rater re...

متن کامل

بررس پایایی رادیولوژیست ها و عملکرد آنها در تشخیص وخامت توده های تخمدان از روی سونوگرافی

Background: Intra-rater agreement in observing and decision making in diagnosis of any disease is of great importance.This investigation is to observe and read ultrasound pictures of ovarian cysts and distinguish its category for any radiologist. Distinguishability is one of the related entities in this matter and radiologists;apos ability in correct diagnosis is of great concern. In this study...

متن کامل

Detecting differential rater functioning over time (DRIFT) using a Rasch multi-faceted rating scale model.

This paper describes a class of rater effects that depict rater-by-time interactions. We refer to this class of rater effects as DRIFT differential rater functioning over time. This article describes several types of DRIFT (primacy/recency, differential centrality/extremism, and practice/fatigue) and Rasch measurement procedures designed to identify these types of DRIFT in rating data. These pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000